{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Explore Your Dataset With Python and Pandas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Environment Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prerequisites"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Installing Pandas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```console\n",
    "$ python -m pip install pandas\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```console\n",
    "$ conda install pandas\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Hello Pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "\n",
    "download_url = \"https://raw.githubusercontent.com/fivethirtyeight/data/master/nba-elo/nbaallelo.csv\"\n",
    "target_csv_path = \"nba_all_elo.csv\"\n",
    "\n",
    "response = requests.get(download_url)\n",
    "response.raise_for_status()\n",
    "with open(target_csv_path, \"wb\") as f:\n",
    "    f.write(response.content)\n",
    "print(\"Download ready.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "nba = pd.read_csv(\"nba_all_elo.csv\")\n",
    "type(nba)\n",
    "# Expected:\n",
    "# <class 'pandas.core.frame.DataFrame'>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(nba)\n",
    "# Expected:\n",
    "# 126314"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nba.shape\n",
    "# Expected:\n",
    "# (126314, 23)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nba.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.set_option(\"display.max.columns\", None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.set_option(\"display.precision\", 2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nba.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Exercise\n",
    "\n",
    "Can you print the last 3 lines of our `DataFrame`?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Getting to Know Your Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Displaying Data Types With `.info()`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nba.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Showing Basics Statistics With `.describe()`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nba.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "nba.describe(include=np.object)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A Mini Data Exploration With Pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nba[\"team_id\"].value_counts()\n",
    "# Expected:\n",
    "# BOS    5997\n",
    "# NYK    5769\n",
    "# LAL    5078\n",
    "\n",
    "# SDS      11"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nba[\"fran_id\"].value_counts()\n",
    "# Expected:\n",
    "# Name: team_id, Length: 104, dtype: int64\n",
    "# Lakers          6024\n",
    "# Celtics         5997\n",
    "# Knicks          5769\n",
    "\n",
    "# Huskies           60\n",
    "# Name: fran_id, dtype: int64"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nba.loc[nba[\"fran_id\"] == \"Lakers\", \"team_id\"].value_counts()\n",
    "# Expected:\n",
    "# LAL    5078\n",
    "# MNL     946\n",
    "# Name: team_id, dtype: int64"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nba.loc[nba[\"team_id\"] == \"MNL\", \"date_game\"].min()\n",
    "# Expected:\n",
    "# '1/1/1949'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nba.loc[nba[\"team_id\"] == \"MNL\", \"date_game\"].max()\n",
    "# Expected:\n",
    "# '4/9/1959'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nba.loc[nba[\"team_id\"] == \"MNL\", \"date_game\"].agg((\"min\", \"max\"))\n",
    "# Expected:\n",
    "# min    1/1/1949\n",
    "# max    4/9/1959\n",
    "# Name: date_game, dtype: object"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Exercise\n",
    "\n",
    "Find out how many points the Boston Celtics have scored during all its matches contained in this dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Getting to Know Pandas' Data Structures"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `Series`: The Basic Building Block in Pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "revenues = pd.Series([5555, 7000, 1980])\n",
    "revenues\n",
    "# Expected:\n",
    "# 0    5555\n",
    "# 1    7000\n",
    "# 2    1980\n",
    "# dtype: int64"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "revenues.values\n",
    "# Expected:\n",
    "# array([5555, 7000, 1980])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "revenues.index\n",
    "# Expected:\n",
    "# RangeIndex(start=0, stop=3, step=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "type(revenues.values)\n",
    "# Expected:\n",
    "# <class 'numpy.ndarray'>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "city_revenues = pd.Series(\n",
    "    [4200, 8000, 6500],\n",
    "    index=[\"Amsterdam\", \"Toronto\", \"Tokyo\"]\n",
    ")\n",
    "city_revenues\n",
    "# Expected:\n",
    "# Amsterdam    4200\n",
    "# Toronto      8000\n",
    "# Tokyo        6500\n",
    "# dtype: int64"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `Pandas Series` vs Built-In Python Data Structures"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "city_employee_count = pd.Series({\"Amsterdam\": 5, \"Tokyo\": 8})\n",
    "city_employee_count\n",
    "# Expected:\n",
    "# Amsterdam    5\n",
    "# Tokyo        8\n",
    "# dtype: int64"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "city_employee_count.keys()\n",
    "# Expected:\n",
    "# Index(['Amsterdam', 'Tokyo'], dtype='object')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\"Tokyo\" in city_employee_count\n",
    "# Expected:\n",
    "# True"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\"New York\" in city_employee_count\n",
    "# Expected:\n",
    "# False"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `DataFrame`: Pandas' Most Popular Data Structure"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "city_data = pd.DataFrame({\n",
    "    \"revenue\": city_revenues,\n",
    "    \"employee_count\": city_employee_count\n",
    "})\n",
    "city_data\n",
    "# Expected:\n",
    "# revenue  employee_count\n",
    "# Amsterdam     4200             5.0\n",
    "# Tokyo         6500             8.0\n",
    "# Toronto       8000             NaN"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "city_data.index\n",
    "# Expected:\n",
    "# Index(['Amsterdam', 'Tokyo', 'Toronto'], dtype='object')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "city_data.values\n",
    "# Expected:\n",
    "# array([[4.2e+03, 5.0e+00],\n",
    "# [6.5e+03, 8.0e+00],\n",
    "# [8.0e+03,     nan]])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "city_data.axes\n",
    "# Expected:\n",
    "# [Index(['Amsterdam', 'Tokyo', 'Toronto'], dtype='object'),\n",
    "# Index(['revenue', 'employee_count'], dtype='object')]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "city_data.axes[0]\n",
    "# Expected:\n",
    "# Index(['Amsterdam', 'Tokyo', 'Toronto'], dtype='object')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "city_data.axes[1]\n",
    "# Expected:\n",
    "# Index(['revenue', 'employee_count'], dtype='object')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "city_data.keys()\n",
    "# Expected:\n",
    "# Index(['revenue', 'employee_count'], dtype='object')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\"Amsterdam\" in city_data\n",
    "# Expected:\n",
    "# False"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\"revenue\" in city_data\n",
    "# Expected:\n",
    "# True"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Exercise\n",
    "\n",
    "Display the index and the axes of the `nba` dataset.  \n",
    "Check whether it has a column \"points\". Or was it called \"pts\"?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Accessing Elements for Python `list` and Pandas `Series`: Similarities and Differences"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The Benefits and Limitations of the Python Indexing Operator `[]`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "city_revenues\n",
    "# Expected:\n",
    "# Amsterdam    4200\n",
    "# Toronto      8000\n",
    "# Tokyo        6500\n",
    "# dtype: int64"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "city_revenues[\"Toronto\"]\n",
    "# Expected:\n",
    "# 8000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "city_revenues[1]\n",
    "# Expected:\n",
    "# 8000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "city_revenues[-1]\n",
    "# Expected:\n",
    "# 6500"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "city_revenues[1:]\n",
    "# Expected:\n",
    "# Toronto    8000\n",
    "# Tokyo      6500\n",
    "# dtype: int64"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "city_revenues[\"Toronto\":]\n",
    "# Expected:\n",
    "# Toronto    8000\n",
    "# Tokyo      6500\n",
    "# dtype: int64"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Accessing `Series`' Elements via `.loc` and `.iloc`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "colors = pd.Series(\n",
    "    [\"red\", \"purple\", \"blue\", \"green\", \"yellow\"],\n",
    "    index=[1, 2, 3, 5, 8]\n",
    ")\n",
    "colors\n",
    "# Expected:\n",
    "# 1       red\n",
    "# 2    purple\n",
    "# 3      blue\n",
    "# 5     green\n",
    "# 8    yellow\n",
    "# dtype: object"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "colors.loc[1]\n",
    "# Expected:\n",
    "# 'red'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "colors.iloc[1]\n",
    "# Expected:\n",
    "# 'purple'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Return the elements with the implicit index: 1, 2\n",
    "colors.iloc[1:3]\n",
    "# Expected:\n",
    "# 2    purple\n",
    "# 3      blue\n",
    "# dtype: object"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Return the elements with the explicit index between 3 and 8\n",
    "colors.loc[3:8]\n",
    "# Expected:\n",
    "# 3      blue\n",
    "# 5     green\n",
    "# 8    yellow\n",
    "# dtype: object"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "colors.iloc[-2]\n",
    "# Expected:\n",
    "# 'green'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Accessing the Elements of a `DataFrame`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Accessing Columns of a `DataFrame`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "city_data[\"revenue\"]\n",
    "# Expected:\n",
    "# Amsterdam    4200\n",
    "# Tokyo        6500\n",
    "# Toronto      8000\n",
    "# Name: revenue, dtype: int64"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "type(city_data[\"revenue\"])\n",
    "# Expected:\n",
    "# pandas.core.series.Series"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "city_data.revenue\n",
    "# Expected:\n",
    "# Amsterdam    4200\n",
    "# Tokyo        6500\n",
    "# Toronto      8000\n",
    "# Name: revenue, dtype: int64"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "toys = pd.DataFrame([\n",
    "    {\"name\": \"ball\", \"shape\": \"sphere\"},\n",
    "    {\"name\": \"Rubik's cube\", \"shape\": \"cube\"}\n",
    "])\n",
    "toys[\"shape\"]\n",
    "# Expected:\n",
    "# 0    sphere\n",
    "# 1      cube\n",
    "# Name: shape, dtype: object"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "toys.shape\n",
    "# Expected:\n",
    "# (2, 2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Accessing Rows of a `DataFrame` With `.loc` and `.iloc`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "city_data.loc[\"Amsterdam\"]\n",
    "# Expected:\n",
    "# revenue           4200.0\n",
    "# employee_count       5.0\n",
    "# Name: Amsterdam, dtype: float64"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "city_data.loc[\"Tokyo\": \"Toronto\"]\n",
    "# Expected:\n",
    "# revenue employee_count\n",
    "# Tokyo   6500    8.0\n",
    "# Toronto 8000    NaN"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "city_data.iloc[1]\n",
    "# Expected:\n",
    "# revenue           6500.0\n",
    "# employee_count       8.0\n",
    "# Name: Tokyo, dtype: float64"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Exercise\n",
    "\n",
    "Display the 2nd last row of the `nba` dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Accessing a Subset of Rows and Columns in a `DataFrame` With `.loc` and `.iloc`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "city_data.loc[\"Amsterdam\": \"Tokyo\", \"revenue\"]\n",
    "# Expected:\n",
    "# Amsterdam    4200\n",
    "# Tokyo        6500\n",
    "# Name: revenue, dtype: int64"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Exercise\n",
    "\n",
    "Have a look at the games between the labels 5555 and 5559. We're interested only in the names of the teams and the scores."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Queries With Pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "current_decade = nba[nba[\"year_id\"] > 2010]\n",
    "current_decade.shape\n",
    "# Expected:\n",
    "# (12658, 23)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "games_with_notes = nba[nba[\"notes\"].notnull()]\n",
    "games_with_notes.shape\n",
    "# Expected:\n",
    "# (5424, 23)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ers = nba[nba[\"fran_id\"].str.endswith(\"ers\")]\n",
    "ers.shape\n",
    "# Expected:\n",
    "# (27797, 23)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nba[\n",
    "    (nba[\"_iscopy\"] == 0) &\n",
    "    (nba[\"pts\"] > 100) &\n",
    "    (nba[\"opp_pts\"] > 100) &\n",
    "    (nba[\"team_id\"] == \"BLB\")\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Exercise\n",
    "\n",
    "In the spring of 1992, both teams from Los Angeles had to play a home game at another court. Find those 2 games. (Both teams have an ID starting with \"LA\".)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Aggregations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Aggregations for `Series`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "city_revenues.sum()\n",
    "# Expected:\n",
    "# 18700"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "city_revenues.max()\n",
    "# Expected:\n",
    "# 8000"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Aggregations for `DataFrame`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "points = nba[\"pts\"]\n",
    "type(points)\n",
    "# Expected:\n",
    "# <class 'pandas.core.series.Series'>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "points.sum()\n",
    "# Expected:\n",
    "# 12976235"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Grouping"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nba.groupby(\"fran_id\", sort=False)[\"pts\"].sum()\n",
    "# Expected:\n",
    "# fran_id\n",
    "# Huskies           3995\n",
    "# Knicks          582497\n",
    "# Stags            20398\n",
    "# Falcons           3797\n",
    "# Capitols         22387"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nba[\n",
    "    (nba[\"fran_id\"] == \"Spurs\") &\n",
    "    (nba[\"year_id\"] > 2010)\n",
    "].groupby([\"year_id\", \"game_result\"])[\"game_id\"].count()\n",
    "# Expected:\n",
    "# year_id  game_result\n",
    "# 2011     L              25\n",
    "# W              63\n",
    "# 2012     L              20\n",
    "# W              60\n",
    "# 2013     L              30\n",
    "# W              73\n",
    "# 2014     L              27\n",
    "# W              78\n",
    "# 2015     L              31\n",
    "# W              58\n",
    "# Name: game_id, dtype: int64"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Exercise\n",
    "\n",
    "Let's have a look at the Golden State Warriors' season in 2014-15 (year_id: 2015). How many wins and losses did they score during the regular season and the playoffs?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Manipulating Columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = nba.copy()\n",
    "df.shape\n",
    "# Expected:\n",
    "# (126314, 23)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Adding New Columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"difference\"] = df.pts - df.opp_pts\n",
    "df.shape\n",
    "# Expected:\n",
    "# (126314, 24)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"difference\"].max()\n",
    "# Expected:\n",
    "# 68"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Renaming Columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "renamed_df = df.rename(\n",
    "    columns={\"game_result\": \"result\", \"game_location\": \"location\"}\n",
    ")\n",
    "renamed_df.info()\n",
    "# Expected:\n",
    "# <class 'pandas.core.frame.DataFrame'>\n",
    "# RangeIndex: 126314 entries, 0 to 126313\n",
    "# Data columns (total 24 columns):\n",
    "# gameorder      126314 non-null int64\n",
    "\n",
    "# location       126314 non-null object\n",
    "# result         126314 non-null object\n",
    "# forecast       126314 non-null float64\n",
    "# notes          5424 non-null object\n",
    "# difference     126314 non-null int64\n",
    "# dtypes: float64(6), int64(8), object(10)\n",
    "# memory usage: 23.1+ MB"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Deleting Columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.shape\n",
    "# Expected:\n",
    "# (126314, 24)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "elo_columns = [\"elo_i\", \"elo_n\", \"opp_elo_i\", \"opp_elo_n\"]\n",
    "df.drop(elo_columns, inplace=True, axis=1)\n",
    "df.shape\n",
    "# Expected:\n",
    "# (126314, 20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Changing the Data Type of Columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"date_game\"] = pd.to_datetime(df[\"date_game\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"game_location\"].nunique()\n",
    "# Expected:\n",
    "# 3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"game_location\"].value_counts()\n",
    "# Expected:\n",
    "# A    63138\n",
    "# H    63138\n",
    "# N       38\n",
    "# Name: game_location, dtype: int64"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"game_location\"] = pd.Categorical(df[\"game_location\"])\n",
    "df[\"game_location\"].dtype\n",
    "# Expected:\n",
    "# CategoricalDtype(categories=['A', 'H', 'N'], ordered=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Exercise\n",
    "\n",
    "Find another column which has a too generic data type."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cleaning Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Missing Values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nba.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rows_without_missing_data = nba.dropna()\n",
    "rows_without_missing_data.shape\n",
    "# Expected:\n",
    "# (5424, 23)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_without_missing_columns = nba.dropna(axis=1)\n",
    "data_without_missing_columns.shape\n",
    "# Expected:\n",
    "# (126314, 22)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_with_default_notes = nba.copy()\n",
    "data_with_default_notes[\"notes\"].fillna(\n",
    "    value=\"no notes at all\",\n",
    "    inplace=True\n",
    ")\n",
    "data_with_default_notes[\"notes\"].describe()\n",
    "# Expected:\n",
    "# count              126314\n",
    "# unique                232\n",
    "# top       no notes at all\n",
    "# freq               120890\n",
    "# Name: notes, dtype: object"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Invalid Values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nba[nba[\"pts\"] == 0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Inconsistencies Between Values in Different Columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nba[(nba[\"pts\"] > nba[\"opp_pts\"]) & (nba[\"game_result\"] != 'W')].empty\n",
    "# Expected:\n",
    "# True"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nba[(nba[\"pts\"] < nba[\"opp_pts\"]) & (nba[\"game_result\"] != 'L')].empty\n",
    "# Expected:\n",
    "# True"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Working With Multiple Datasets in Pandas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Combining Data Based on the Index With `concat()`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "further_city_data = pd.DataFrame(\n",
    "    {\"revenue\": [7000, 3400], \"employee_count\":[2, 2]},\n",
    "    index=[\"New York\", \"Barcelona\"]\n",
    ")\n",
    "all_city_data = pd.concat([city_data, further_city_data], sort=False)\n",
    "all_city_data\n",
    "# Expected:\n",
    "# Amsterdam   4200    5.0\n",
    "# Tokyo       6500    8.0\n",
    "# Toronto     8000    NaN\n",
    "# New York    7000    2.0\n",
    "# Barcelona   3400    2.0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "city_countries = pd.DataFrame({\n",
    "    \"country\": [\"Holland\", \"Japan\", \"Holland\", \"Canada\", \"Spain\"],\n",
    "    \"capital\": [1, 1, 0, 0, 0]},\n",
    "    index=[\"Amsterdam\", \"Tokyo\", \"Rotterdam\", \"Toronto\", \"Barcelona\"]\n",
    ")\n",
    "cities = pd.concat([all_city_data, city_countries], axis=1, sort=False)\n",
    "cities\n",
    "# Expected:\n",
    "# revenue  employee_count  country  capital\n",
    "# Amsterdam   4200.0             5.0  Holland      1.0\n",
    "# Tokyo       6500.0             8.0    Japan      1.0\n",
    "# Toronto     8000.0             NaN   Canada      0.0\n",
    "# New York    7000.0             2.0      NaN      NaN\n",
    "# Barcelona   3400.0             2.0    Spain      0.0\n",
    "# Rotterdam      NaN             NaN  Holland      0.0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.concat([all_city_data, city_countries], axis=1, join=\"inner\")\n",
    "# Expected:\n",
    "# revenue  employee_count  country  capital\n",
    "# Amsterdam     4200             5.0  Holland        1\n",
    "# Tokyo         6500             8.0    Japan        1\n",
    "# Toronto       8000             NaN   Canada        0\n",
    "# Barcelona     3400             2.0    Spain        0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Combining Data Based on Arbitrary Columns With `merge()`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "countries = pd.DataFrame({\n",
    "    \"population_millions\": [17, 127, 37],\n",
    "    \"continent\": [\"Europe\", \"Asia\", \"North America\"]\n",
    "}, index= [\"Holland\", \"Japan\", \"Canada\"])\n",
    "pd.merge(cities, countries, left_on=\"country\", right_index=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.merge(\n",
    "    cities,\n",
    "    countries,\n",
    "    left_on=\"country\",\n",
    "    right_index=True,\n",
    "    how=\"left\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualizing Your Pandas DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nba[nba[\"fran_id\"] == \"Knicks\"].groupby(\"year_id\")[\"pts\"].sum().plot()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nba[\"fran_id\"].value_counts().head(10).plot(kind=\"bar\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Exercise\n",
    "\n",
    "In 2013, the Miami Heat won the championship. Create a pie plot showing the count of their wins and losses during that season."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion and Further Resources"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "argv": [
    "/home/reka/anaconda3/bin/python",
    "-m",
    "ipykernel_launcher",
    "-f",
    "{connection_file}"
   ],
   "display_name": "Python 3",
   "env": {},
   "interrupt_mode": "signal",
   "language": "python",
   "metadata": {},
   "name": "python3"
  },
  "language": "python"
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
